Triton 程式設計入門：從線程轉向程式實例

在 Triton 中，執行的基本單位從 CUDA 標量線程轉移至程式實例。這代表了 GPU 線程塊的一種抽象，其中單一實例可同時處理一個向量化「區塊」的元素。

每個執行單元皆透過以下方式取得其身分 pid = tl.program_id(axis=0)。想像一輛 倉儲叉車 （程式實例）搬運一個托盤（區塊）共 128 個箱子，與單一工人（CUDA 線程）僅搬一箱形成對比。

理解語義差異對於記憶體管理至關重要：

PyTorch 視圖
指向連續全域記憶體的 Python 物件。

Triton 視圖
編譯器暫存器內的二維／一維資料區塊。

Triton 遵循 單一程式、多重資料（SPMD） 流程。每個程式實例都執行 完全相同的 程式碼。分歧僅當邏輯利用 pid 來計算特定的記憶體偏移量時才會發生。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary identifier for a Triton execution unit?

threadIdx.x

tl.program_id(axis=0)

tl.block_idx()

torch.get_id()

QUESTION 2

True or False: A Triton tensor is a Python object that stores metadata like strides on the host CPU.

True

False

QUESTION 3

What is the result of 'forgetting that all program instances execute the same kernel body'?

The compiler will automatically distribute tasks.

Race conditions or overwriting memory if pid-based logic is missing.

The kernel will fail to compile due to a syntax error.

Execution time will double.

QUESTION 4

In the forklift analogy, what does the 'Aisle Number' represent?

The BLOCK_SIZE

The program_id (pid)

The GPU Driver version

The Pointer address

QUESTION 5

Why is the Triton model considered 'Vectorized' compared to CUDA?

It uses Python lists.

One Program Instance handles a block of elements, not just one scalar element.

It only works with 2D matrices.

It runs on the CPU's SIMD units.